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ABSTRACT 

We introduce SoFiA, a flexible software application for the detection and parameterization 
of sources in 3D spectral-line datasets. SoFiA combines for the first time in a single piece of 
software a set of new source-finding and parameterization algorithms developed on the way to 
future HI surveys with ASKAP (WALLABY, DINGO) and APERTIF. It is designed to enable 
the general use of these new algorithms by the community on a broad range of datasets. The 
key advantages of SoFiA are the ability to: search for line emission on multiple scales to de¬ 
tect 3D sources in a complete and reliable way, taking into account noise level variations and 
the presence of artefacts in a data cube; estimate the reliability of individual detections; look 
for signal in arbitrarily large data cubes using a catalogue of 3D coordinates as a prior; provide 
a wide range of source parameters and output products which facilitate further analysis by the 
user. We highlight the modularity of SoFiA, which makes it a flexible package allowing users 
to select and apply only the algorithms useful for their data and science questions. This mod¬ 
ularity makes it also possible to easily expand SoFiA in order to include additional methods 
as they become available. The full SoFiA distribution, including a dedicated graphical user 
interface, is publicly available for download. 

Key words: methods: data analysis. 


1 INTRODUCTION 

The detection of astronomical signal above instrumental noise is a 
crucial aspect of all astronomy observations. The techniques em¬ 
ployed to detect and characterise this signal depend on the type of 
data being analysed (see |Masias et al.|2012| for a review). Standard 
methods and tools have emerged in fields with a large community 
base such as 2D imaging (e.g., SExtractor; [Bertin & Arnouts| 
\\996) and ID spectroscopy (e.g., GANDALF; |Sarzi et al.|2006| ).ln 
other fields with relatively fewer users, detection algorithms vary 
significantly between projects. This is the case for studies based on 
3D spectral line data (for brevity, data cubes), where the flux of 
a spectral line is mapped as a function of position on the sky and 
line-of-sight velocity of the emitting matter. 

The diversity of source finding methods for data cubes is at 
least partly due to the diversity of 3D structure of the sources being 
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Studied. We illustrate this point i n Fig. \T\ where w e show a data 
cube of the ATLAS^^ HI survey ( Serra et al.|2012 i. In this figure, 
the central object is bright (and therefore easy to detect) but has 
a complex 3D structure, including a low surface brightness exten¬ 
sion towards large RA. On the contrary, the top object is bright and 
relatively simple as emission is confined within a small range of 
RA and Dec. Finally, the bottom object is the typical case of a re¬ 
solved, edge-on galaxy where the two peaks of the double-horn ve¬ 
locity profile are clearly visible, and detection of the faint emission 
between the two peaks is challenging. An ideal 3D source finder 
should be able to detect and parameterize all these different sources 
in a complete and reliable way. 


Radio single dishes and interferometers have traditionally 
been the most common telescopes used to construct data cubes (al¬ 
though optical integral-field spectrographs are now also generating 
large numbers of such cubes - e.g., |Cappellari et al.|2011[[Croom| 
|et al.|201^|Sanchez et al.|20T^ . The upgrade and continuing oper¬ 
ation of existing radio telescopes, as well as the construction of 
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Figure 1. Volume rendering of an H i data cube showing that individual 
sources have a complex and diverse 3D structure. This makes their detection 
and accurate parameterization challenging. 


the Square Kilometre Array and its precursors, are leading to a 
rapid increase in the number and size of data cubes. Standard and 
sufficiently general source-finding tools will be necessary to anal¬ 
yse these data, and recent work has started addressing this need 
(see, e.g., Duchamp by |Whiting|2012| ). In this paper we introduce 
SoFiA, a new, fiexible Source Finding Application for data cubes 
which combines detection algorithms and techniques from several 
source finders. 

SoFiA is designed to work on any data cube independent 
of telescope or observed spectral line. However, its development 
is part of preparatory work for a few specific, upcoming HI sur¬ 
veys: WALLABY, a blind HI survey of 3/4 of the entire sky out 
to z ~ 0.25 to be carried out with the Australian Square Kilometre 
Array Pathfinder (ASKAP; see |Koribalski|20 1 2a| l ; DINGO, a deep 
HI survey out to z ~ 0.4 (also to be carried out with ASKAP; see 
Meyer|200^ ; and the HI surveys planned for APERTIF ( [Verheijen] 
et al.|200^ This preparatory work has resulted in the development 
of a number of new source-finding algorithms, which are described 
in a series of papers referred to in the next Section (for a summary 
see |Koribalski|20l2b) . SoFiA puts these different algorithms to¬ 
gether for the first time in a coherent, flexible and publicly available 
piece of software. 

SoFiA can be obtained from https://github.com/ 
SoFiA-Admin/SoFiA . On the same webpage we provide a list 
of requirements, installation instructions and a user manual. The 
aim of this paper is to describe how SoFiA operates on data cubes 
and thereby provide a reference for current and future users. 


2 DESCRIPTION OF SoFiA 

SoFiA is a modular application whose aim is to detect and param¬ 
eterize sources in a data cube. The flowchart in Fig. shows the 
various modules that users can choose to use (or not to use), in the 
order in which they are executed by SoFiA. Once an input data 



Figure 2. SoFiA flowchart. We highlight the “Filter output” module with 
a dashed box as this will become available in future releases of SoFiA. 


cube (or a sub-cube selected by the user) is loaded, these modules 
allow users to: 

• modify the input cube by applying flags, weights, or a set of 
filters; 

• detect the spectral line signal; 
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Figure 3. Screenshot of the SoFiA GUI. The GUI adopts automatically 
the native style of the window manager used on the system where SoFiA 
is installed. In this figure we show the GUI as it appears on a Kubuntu Linux 
system. The GUI also offers the option of displaying the source catalogue 
generated by SoFiA and includes a help browser that explains the available 
parameter settings. 


• identify sources by merging detected voxels together; 

• reject false detections; 

• optimize the mask of individual sources; 

• measure source parameters; 

• filter the output by selecting a region of interest in source pa¬ 
rameter space; 

• and produce output catalogues as well as cubes, moment 
maps, position-velocity diagrams and integrated spectra. 


Individual modules are described in more detail in the rest of this 
Section. They are written in either Python or C-f-i- and rely on a 
range of external libraries, including NumPy and Sci Py ([Jones et al. 


2001 


2011 


[Walt, Colbert & Varoquaux| 2011|), Cython ( [Behnel et al. 
_ , Astropy ( [Astropy Collaboration et aL|2013|), the GNU Sci¬ 
entific Librar 3 p]and, optionally, matplotlib ( [Hunter 2007 1. Provided 
that these libraries are available, SoFiA can run on all machines 
with a Unix or Linux operating system (including, e.g., Mac OS X 
and Ubuntu). We refer to the SoFiA webpage for up-to-date de¬ 
tails. 


SoFiA can be executed from the command line or using a 
dedicated graphical user interface (GUI) based on the Qt library 
(see Fig.[^. Both methods allow users to select which combination 
of the above modules and which source finding and parameteriza¬ 
tion algorithms to use. This selection is done using either the GUI 
or a plain text parameter file (if running SoFiA from the com¬ 
mand line), allowing the source-finding strategy and its complexity 
to be optimized for the type of data and sources of interest. For 
example, SoFiA could be asked the simple question of creating 
a moment-0 image of all voxels above a given threshold in a data 
cube - in which case most of SoFiA’s functionalities would be 
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switched off. Alternatively, it could be given a number of relatively 
more complex tasks such as, for example, applying a wavelet fil¬ 
tering algorithm, rejecting false detections or fitting models to the 
spectrum of the detected sources. 

While SoFiA will continue to be improved, this basic princi¬ 
ple of modularity will not change. Therefore, although this paper 
describes the software as it is at the time of writing and new al¬ 
gorithms may be introduced in the future, the main workings of 
SoFiA will remain as illustrated here. 


2.1 Data cube, weights cube, mask cubes and filters 

Four different types of input and/or output cubes are relevant at 
different stages of SoFiA. 

• Data cube, which includes signal from astronomical sources 
superimposed on instrumental noise (and errors). 

• Weights cube, which allows users to weight voxel values to 
take into account, e.g., noise level variations across the cube or the 
presence of imaging artefacts in certain regions of the data cube. 

• Binary mask, where detected and non-detected voxels have 
values of 1 and 0, respectively. 

• Object mask, where non-detected voxels have a value of 0 and 
detected voxels have an integer value corresponding to the ID of 
the object they belong to. 

All source-finding algorithms implemented in SoFiA and de¬ 
scribed in Sec. 12.21 below assume that the noise level is uniform 
across the data cube. Therefore, noise variations caused by, e.g., 
mosaicking or frequency-dependent fiagging need to be removed 
first. This can be done within SoFiA by means of a weights cube 
inversely proportional to the noise level. SoFiA removes noise 
variations by multiplying the data cube by the weights cube. Once 
source detection is completed, SoFiA will undo this operation be¬ 
fore measuring source parameters. The weights cube could also be 
useful to down-weight regions of a data cube affected by imaging 
artefacts (e.g., cleaning or continuum-subtraction residuals). 

The weights cube can be provided by the user. Alternatively, 
users can provide an analytic description of the weights variation 
across the cube. Finally, a weights cube inversely proportional to 
the local noise level can be derived by SoFiA and applied to the 
data cube. The evaluation of the local noise level is carried out in¬ 
dependently along any or all of the three axes of the data cube. 
For example, a user may wish to remove noise variations along the 
frequency axis alone, under the assumption that the noise does not 
vary within each frequency plane. 

We note that SoFiA measures the noise within a data cube at 
various other stages of the processing. Different methods of noise 
measurement are implemented and users can decide which one is 
more appropriate for their purpose. Possible choices are: i) standard 
deviation; ii) median absolute deviation; and Hi) standard deviation 
of a zero-centred Gaussian fit to the negative side of the fiux his¬ 
togram. 

The calculation and application of the inverse-noise weights 
cube described above is part of a more general SoFiA module 
which allows users to apply a filter to the data cube before run¬ 
ning the selected source-finding and parameterization algorithms. 
As indicated in Fig.[^ this module includes two additional filtering 
methods: firstly, the convolution with a 3D kernel whose shape can 
be chosen among a few options and whose size can be specified 
by the user; and secondly, the 2D-ID wavelet de-noising algorithm 
developed by [Fiber & Winkel| ( |2012| ). This algorithm processes the 
two spatial dimensions and the spectral dimension of the data cube 
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separately, and returns a noise-free data cube reconstructed using 
only wavelet coefficients above a specified threshold. Additional 
filtering options may be provided in future releases. 

As indicated by the fiowchart in Fig. portions of the cube 
can be blanked out (fiagged) prior to source finding. This may be 
necessary at the location of very bright continuum sources whose 
spectrum was not subtracted properly from the data, or at channels 
dominated by line emission from the Galaxy or affected by strong 
radio frequency interference. 

Finally, mask cubes are generally calculated within SoFiA 
(see below) but can also be provided by the user. The latter could 
be desirable if a user, following an initial source-finding run, wishes 
to look for additional sources with a different search algorithm or 
parameters. In this case the new sources are added to the initial, 
input mask. Alternatively, an input mask could be used if sources 
have already been identified and only subsequent parameterization 
steps are required. 

2.2 Detection of spectral line signal 

SoFiA is meant to offer a number of detection algorithms that 
users can choose from. A common advantage of these algorithms 
is the ability to look for emission on multiple scales, which is es¬ 
sential to detect sources in 3D (see Fig[^. An exception is the sim¬ 
ple threshold method (see below), unless used in combination with 
some of the filtering methods described above (e.g., 2D-ID wavelet 
denoising). The following algorithms are implemented in SoFiA. 

• Simple threshold. This is the simplest possible algorithm (and 
the only one not operating on multiple scales): only voxels whose 
absolute value is above a specified threshold are detected. Users 
can specify the threshold in flux units or relative to the noise level. 

• S-i-C. This is the smooth -i- clip algorithm developed by |Serra| 
|et al.| ( |2012| ) on the basis of techniques traditionally used within 
the HI community. It consists of searching for emission at mul¬ 
tiple angular and velocity resolutions by smoothing the data cube 
with 3D kernels specified by the user. At each resolution, voxels 
are detected if their absolute value is above a threshold given by 
the user (in noise units). The final mask is the union of the masks 
constructed at the various resolutions. 

• CNHI. This algorithm was developed by |Jur^([20^. Indi¬ 
vidual ID spectra (or bundles of adjacent spectra) are extracted 
from the data cube. For each of them, the Kuiper test is used to 
identify regions of the spectrum which are not consistent with con¬ 
taining only noise. In practice, users need to provide a probability 
threshold above which a spectral region is considered detected and 
is added to the final binary mask. 

The numerous possible combinations of these source-finding meth¬ 
ods together with the filtering algorithms described in Sec. |r7]al- 
low users to design a number of different strategies to detect signal 
in their data cube. For example, the CNHI finder could be run fol¬ 
lowing convolution with a 3D kernel appropriate for the type of 
sources being searched. Alternatively, a simple threshold method 
could be used after the noise has been removed from the cube by 
the 2D-ID wavelet filter. 

[Popping et al.| ( |2012| ) discuss strengths and weaknesses of 
these algorithms and compare their performance. An important rec¬ 
ommendation of that work is that all source finders should incorpo¬ 
rate some form of 3D smoothing in order to increase completeness. 
In this respect, the simple threshold algorithm is of limited use un¬ 
less coupled with a filtering methods such as the 2D-ID wavelet 
de-noising. [Popping et al.j ( [2012[ ) find this particular combination 


to deliver higher completeness and reliability than S-i-C and CNHI 
for sources unresolved on the sky, especially at narrow line widths. 
In contrast, the S-i-C method is by construction well suited to find¬ 
ing sources on a variety of scales and [Popping et al.[ ( |2012j ) deem it 
the best choice for extended objects. 

We note that many of the algorithms have been improved since 
the comparative study of [Popping et al.[ ( [2012[ ). Additional testing 
can now be carried out within SoFiA and will be used to investi¬ 
gate how to further improve their performance. Until then, we refer 
to the aforementioned papers for a complete discussion of these 
methods. 

All the above algorithms return a binary mask of detected vox¬ 
els (and any additional source-finding algorithm could be added to 
SoFiA as long as they satisfy this condition). As an example, the 
top panels of Fig.|^show five channels extracted from the data cube 
in Fig.[^and, with black contours, the regions included in the binary 
mask. In this case the S-i-C finder was employed using 12 different 
smoothing kernels. The relatively low adopted threshold (3.5cr) re¬ 
sults in a number of noise peaks being included in the mask. We 
come back to this point in Sec. [2.4[ 

2.3 Merging detected voxels into sources 

The aforementioned binary mask is the basis for identifying indi¬ 
vidual sources or objects. In SoFiA, this computationally expen¬ 
sive operation is performed using the C-i-i- implementation of the 
[Lutz[(T980] ) one-pass algorithm by [Jurek[ ( [2012[ ), combined with a 
sparse representation of 3D objects. We refer to [Jurek[ ( |2012[ ) for 
details on this implementation. Here it is sufficient to say that this 
algorithm produces the same result as a friends-of-friends method 
with linking element equal to an elliptic cylinder. Users can specify 
the cylinder size. This step of SoFiA also returns basic source pa¬ 
rameters such as total flux, peak flux (both normalised by the noise 
level) and size. 

The bottom panels of Fig.|^show the objects created from the 
binary mask using a merging cylinder with a radius of 3 pixels and 
a height of 7 channels (we show only objects with positive total 
fiux). These panels show four real detections as well as a number 
of positive noise-peak objects. It is worth highlighting the success¬ 
ful detection of a faint, extended HI tail east of the brightest galaxy 
(second panel from the left). This detection is made possible by 
the fact that SoFiA looks for emission on multiple scales. Further¬ 
more, SoFiA correctly identifies as a single source the resolved, 
edge-on galaxy located in the southern part of the cube (visible in 
all panels but the first) despite the low level emission at channels 
close to the systemic velocity. 

2.4 Reliability and rejection of false detections 

All detection algorithms listed above require users to specify a de¬ 
tection threshold. The closer this threshold is to the noise, the more 
noise peaks will be included in the resulting binary mask. Some 
of these noise peaks may be identified as separate objects if they 
are sufficiently far from a real object (see bottom panels of Fig.[^. 
SoFiA offers two ways of removing these false detections from 
the final output. 

The first method is a simple size filter and is based on the fact 
that all real detections are at least as large as the data cube’s resolu¬ 
tion. In practice, users can specify the minimum acceptable source 
size along each axis of the cube independently. The downside of 
this method is that it may potentially remove relatively bright but 
unresolved sources from the final object mask. 
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Figure 4. Illustration of the detection of signal and identification of individual sources in SoFiA. Top panels. Channel maps extracted from the data cube 
shown in Fig.[T] The line-of-sight velocity of each channel is indicated in the top-left corner (note that these are not adjacent channels in the original cube). 
The beam is shown in the bottom-left corner. Black contours show regions included in the binary mask (Sec. |2.2^ . Bottom panels. Same channel maps as in 
the top panels but now showing the individual objects formed on the basis of the binary mask (Sec. |2.3) . We show only objects with positive total flux. Each 
object is indicated with a different random colour. Black contours indicate the four objects whose reliability is higher than 99 per cent (Sec. |2.4) . 


The second method is illustrated by |Serra, Jurek & Floer| 
( |2012| ) and estimates the reliability of individual objects by com¬ 
paring the distribution of positive and negative sources (i.e., sources 
with positive and negative total flux, respectively) in parameter 
space. The simple idea is that the distribution of positive and neg¬ 
ative noise peaks should be identical while positive, real detections 
should not have a negative counterpart in parameter space. It is 
based on the assumptions that the noise is symmetric and that real 
sources have positive total flux (i.e., absorption line sources have 
been masked). 

Within SoFiA, the reliability can be calculated following the 
run of any source-flnding algorithm chosen by the user as long as 
both positive and negative noise peaks are included in the binary 
mask, and after the detected voxels are merged into sources (Sec. 
\2.3\ . The reliability calculation also requires that a sufficient num¬ 
ber of negative noise peaks is included in the mask such that their 
distribution in parameter space can be studied meaningfully. Users 
can select to produce diagnostic plots on the reliability calculation 
similar to those shown in |Serra, Jurek & Floer| ( [^12| ). The black 
contours in the bottom panels of Fig. [^highlight objects whose re¬ 
liability is higher than 99 per cent. 

In summary, users can decide to run SoFiA with a high detec¬ 
tion threshold, resulting in a reliable but possibly incomplete cata¬ 
logue of detections; but they can also decide to dig deeper into the 
noise using a lower threshold, and successively remove false detec¬ 
tions. In the latter case, a reliability value can be returned for all 
positive detections. 


2.5 Mask optimization 

SoFiA measures the parameters of all sources (e.g., total flux, size, 
line width) considering only voxels included in the mask cube. 


However, experience shows that masks can miss the faint, outer 
edge of objects, in particular if obtained with a high detection 
threshold. This would introduce systematic effects in the measured 
parameters (e.g., the total flux would be underestimated; see |West-| 
[meier. Popping & Serra|2012] l. To prevent this, SoFiA offers two 
mask optimization methods which modify the object mask cube by 
growing the masks which deflne individual objects. In both meth¬ 
ods, the mask is grown independently for each object. 

The first method is mostly appropriate for sources that are un¬ 
resolved on the sky or, if resolved, face-on and symmetric. It starts 
by fitting an ellipse to the moment-0 image of the object. The el¬ 
lipse is then used as a mask for all velocity channels occupied by 
the object - i.e., the initial mask, which generally has an arbitrary 
3D shape, is converted into an elliptic cylinder. Finally, the size of 
the ellipse is increased until a maximum in total flux is reached (a 
similar method is described by |Barden et al.|2012| in the context of 
2D imaging). 

The above method should in principle be applied only to 
sources which All most of the cylindrical mask in all channels (see 
above), while for objects with a more complex 3D structure it can 
result in a decrease of the integrated signal-to-noise ratio. For this 
reason we provide a second mask growth method. This consists in 
performing a binary dilation of the initial mask along the two spa¬ 
tial axes of the data cube using a 2D dilation structuring element 
whose shape approximates a circle. The size of the structuring ele¬ 
ment is increased iteratively until the total flux converges (i.e., until 
the relative flux growth between successive iterations is lower than 
a threshold specified by the user). This method preserves the 3D 
shape of the initial mask. In addition to growth along the two spa¬ 
tial axes, this algorithm can also grow all masks by a fixed number 
of channels (selected by the user) along the frequency axis. 
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Figure 5. Data products for the brightest galaxy in the cube shown in Fig.^ From left to right: moment-0 image; moment-1 velocity field; position-velocity 
diagram along the morphological major axis (PA = 32°); and integrated spectrum (hlled circles) with the best-htting busy function overlaid (solid line). In the 
latter panel the dotted and dashed lines indicate the line widths Wso and W 20 estimated at 50 and 20 per cent of the peak flux, respectively, on the basis of the 
busy function fit. 


2.6 Source parameterization 

As mentioned in Sec. |2.3| basic source parameters are measured 
when detected voxels are merged into objects. These can be used 
to estimate the reliability of each source and reject false detections 
in parameter space. After mask optimization SoFiA re-computes 
those parameters and measures additional ones. These include: po¬ 
sition (both geometric and centre of mass); total flux; minimum 
and maximum voxel value; size and bounding region along each 
axis; line width measured using different methods (including the 
one proposed by |Courtois et al.||2009| l; results of an ellipse fit to 
the moment-0 image; results of a busy function fit to the integrated 
spectrum (for a description of the busy function see |Westmeier et al.] 
|2014| ). These parameters are provided both in a “raw” format (i.e., 
coordinates in pixel units, fluxes in data units, line-width in chan¬ 
nels) as well as converted into more useful units (e.g., WCS co¬ 
ordinates and standard flux and velocity units). Some of these pa¬ 
rameterization steps are optional and implementation of additional 
parameters is straightforward within the code. 


2.7 Output products 

Users can decide what output SoFiA should produce. Available 
output products include: 

• Catalog of objects and their parameters, both in ASCII and 
VO-compliant XML format. 

• Final object mask. 

• Moment 0 and 1 images of the sky area covered by the full 
data cube determined from the data within the mask. 

• Cut-out data cubes containing individual objects as well as 
their corresponding mask, moment 0, 1 and 2 images, integrated 
spectrum and position-velocity diagram along the morphological 
major axis. An example of these products is shown in Fig.|^ 

In future releases it will be possible to produce these products for 
just a subset of the detections by selecting a region of interest in 
source parameter space. 

This output is designed to not only give useful information 
about the detected sources but also to enable further, higher-level 
analysis by the user. For example, the cut-out cubes of individual 
objects and the corresponding masks could be used to measure ad¬ 
ditional source parameters not included in SoFiA or to produce 
Gauss-Hermite velocity flelds to enable kinematical studies. 


2.8 Performance of SoFiA 

In the current implementation of SoFiA the entire input data cube 
(or the selected sub-cube; see Fig. is loaded into memory and 
processed on a single core. Additional cubes will also need to 
be stored in memory at various stages of processing, such as the 
weights cube, the binary or object mask cube, and a smoothed ver¬ 
sion of the data cube if required by the source-flnding algorithm 
being used (e.g., S-fC), plus a potentially large array of source pa¬ 
rameters. It is therefore interesting to discuss how the memory re¬ 
quirement and execution time of SoFiA vary with cube size. For 
this purpose we make use of two cubes. The smaller cube is the one 
used for illustration purpose in this paper (Figs.[^and|^. It has 360 
pixels along both spatial axes and 150 channels along the frequency 
axis, resulting in a file size of 78 MB. The second cube is the one 
used for the source-flnding test s of|Serra, Jurek & Floer| ( |2012| ) and 
[Westmeier, Popping & Serr^ \2012\ . It too has 360 pixels along 

both spatial axes but consists of 1464 channels along the frequency 
axis. Therefore, its size is ~ 10 times that of the first cube. 

We process the two cubes with identical settings employing 
a representative combination of the algorithms described in this 
paper: noise normalisation along the frequency axis; S-fC source 
finding with 12 smoothing kernels; merging of detected voxels 
into sources; calculation of reliability and removal of unreliable 
sources; optimization of the mask of individual objects using the 
dilation method; source parameterization including busy function 
fit; creation of output products for the cubes as a whole and for the 
individual detections; creation of ASCII and XML catalogues. The 
two runs are carried out on a machine running Linux Mint 17 with 
a memory of 16 GB and a 2.9 GHz Intel Core processor. 

Figure shows the memory usage of SoFiA as a function of 
time for the two cubes. Both axes of the plot are normalised by 
the cube size. The time behaviour of the two curves appears very 
similar, indicating that the execution time scales approximately lin¬ 
early with cube size within the range explored here. The memory 
offset between the two curves is due to the loading into memory of 
a number of libraries used by SoFiA. These come with a memory 
overhead of the order of a few tens of megabytes, which is more 
noticeable in the case of a smaller data cube. For data cubes much 
larger than this overhead the memory usage is between 2 and 3 
times the size of the cube, with occasional peaks between 3 and 4 
times the cube size. 

Figure allows us to investigate the memory and processing 
time taken by the various algorithms. In this case, most of the time 
is taken by the S-fC finder. The beginning and end of its execution 
are marked by open and filled black circles for the small and large 
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Figure 6. Memory usage as a function of time for two runs of SoFiA on two cubes whose sizes differ by a factor ~ 10 (see legend). As explained in the text, 
both runs include the use of the S+C source finder, and open and filled black circles indicate the beginning and the end of the S+C execution for the two cubes, 
respectively. See Sec. |2.8| for a more detailed discussion of the time taken by other algorithms during the two SoFiA runs. 


data cube, respectively. The peaks in memory usage of S+C corre¬ 
spond to the smoothing operations, while the plateaus in between 
peaks correspond to the noise level calculation. Each additional fil¬ 
ter would contribute another ~ 0.05 s/MB to the execution time 
in this case. The noise calculation appears to be particularly time 
consuming. In this case, it is carried out using the aforementioned 
Gaussian fit to the negative side of fiux histogram (Sec. |2.1| ). This 
calculation uses the full cube, and an obvious way to increase its 
speed would be to calculate the noise level on a sub-cube. This op¬ 
tion may become available in the future. 

The time before the beginning of the S+C finder in Fig. is 
taken by the noise normalisation along the frequency axis and by 
an initial measurement of the noise level in the normalised cube. 
The time after S+C is taken by all other algorithms listed above, 
and these are typically much faster. The memory peak right at the 
end of S+C corresponds to the merging of voxels into sources. The 
height of this peak depends on the number of sources detected. This 
is followed by the calculation of the reliability and the rejection of 
unreliable sources, which are both relatively inexpensive in terms 
of memory but can be time consuming. The final memory peaks 
correspond to the creation of moment images. 


2.9 Source-finding based on a catalogue of 3D coordinates 

The above discussion makes it clear that SoFiA is currently not 
able to process arbitrarily large data cubes but is limited by the 
memory of the system on which it is run. This problem is partially 
alleviated by the fact that SoFiA is able to limit the processing to 
a sub-cube whose boundaries are specified by the user. Therefore, 
users could choose to run SoFiA multiple times on sufficiently 
small portions of a large input cube, obtaining individual output 
products for each of them. They could then combine these products, 
creating for example a single mask or catalogue. In the future we 
may be able to offer such breaking up of a large input cube into 
sub-cubes - and the creation of final data products for the full cube 
- as a processing mode fully integrated with the other modules of 
SoFiA. 

In this context, a useful feature already available in SoFiA is 


that it allows users to search for emission in any number of small 
sub-cubes centred at a set of 3D coordinates within an arbitrarily 
large data cube. For example, in an era of large HI and optical 
redshift surveys, this mode could be used to look for emission in 
a large HI cube at the location of galaxies included in an optical 
spectroscopic catalogue. 

This mode is fully integrated in SoFiA and interested users 
need to simply provide the input data cube and a catalogue of 3D 
coordinates. SoFiA will process the various positions sequentially, 
each time loading into memory only the sub-cube of interest. The 
3D size of the sub-cubes can be set by the user and is the same for 
all positions. Users can also request the creation of a single output 
catalogue of sources, which is generated by merging the catalogues 
obtained at each position. 


2.10 Comparison to other source finders 

A number of established software packages for the reduction and 
analysis of interferometric data allow some source finding to be 
carried out on data cubes (e.g., GIPSY, Miriad). However, this 
approach requires users to develop custom codes which make use 
of (and are limited by) the tasks available within those general- 
purpose packages. In contrast, the more specialised SoFiA offers 
a wide range of ready-to-use source-finding algorithms, which are 
already integrated with one another and can be combined in a flex¬ 
ible way to produce a variety of output products. 

The other 3D application for spectral line data which shares 
some of these characteristics is Duchamp ( |Whiting||2012| l. This 
application detects sources using a simple threshold method (sim¬ 
ilar to the one described in Sec. \22) and then grows them using a 
secondary threshold. This algorithm differs from those available in 
SoFiA and, in this respect, the two packages could be seen as com¬ 
plementary ( [Popping et al.|201^ shows that Duchamp has the best 
performance for unresolved sources but does not reach the com¬ 
pleteness of S+C for resolved sources). With respect to memory 
requirements, Duchamp is similar to SoFiA in that it loads and 
processes in one core the full input data cube. Therefore, it too is 
limited by the memory of the system on which it is run. 
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A significant advantage of SoFiA compared to Duchamp is 
that it offers a larger number of algorithms for both source find¬ 
ing and parameterization. This includes the S-i-C and CNHI finders, 
the 2D-ID wavelet de-noising (denoising is available in Duchamp 
but it uses isotropic wavelets, which are not ideal for spectral line 
sources whose size along the spectral axis is decoupled from the 
size along the two spatial axes), the calculation of the reliability of 
individual detections, the mask optimization by binary dilation, the 
possibility of searching for signal on the basis of an input catalogue 
of 3D coordinates and the busy function fit. The creation of cubelets 
and PV diagrams for individual detections is also not included in 
Duchamp but is available in Selavy, a source finder built upon 
Duchamp for distributed processing of large cubes ( [Whiting 
[Humphreys |2012| t. While future development may reduce the dif¬ 
ference between Duchamp/Selavy and our package, all above 
methods are at the moment unique to SoFiA. 

Finally, it is worth mentioning that SoFiA does not offer at 
the moment a full analysis of the sources’ morphology. For exam¬ 
ple, a group of nearby detected voxels is merged into a single source 
regardless of the size of the source and only based on the merging 
element chosen by the user (Sec. [2.3[ ). This means that no informa¬ 
tion is given about whether the source, which could be very large, 
is composed of distinct and easily recognisable components. Dif¬ 
ferent and more specialised source finders are able to provide such 
characterisation (e.g.. Clump find by [Williams, de Geus & Blitz] 
[1994[ BLOBCAT by [Hales et al.|2012] ). We note however that the 
object mask cube returned by SoFiA could be used as a starting 
point for further morphological analysis of the detections. 


3 SUMMARY 

We provide a high-level description of SoFiA, a fiexible source 
finder for 3D spectral line data. SoFiA puts together for the first 
time in a single package a number of new source-finding and pa¬ 
rameterization algorithms developed in preparation of upcoming 
HI surveys with ASKAP (WALLABY, DINGO) and APERTIF. It 
is, however, designed to enable the use of these new algorithms on 
any data cube independent of emission line or telescope used. 

We describe the various methods and algorithms available in 
SoFiA as well as planned developments. One key advantage of 
SoFiA is that it allows users to search for spectral line signal on 
multiple scales on the sky and in frequency (using, e.g., the S-i-C 
finder or the 2D-ID wavelet filter), which is crucial to detect and 
parameterize 3D sources in a complete and reliable way. Further¬ 
more, within SoFiA it is possible to take into account noise level 
variations across the cube and the presence of errors and artefacts. 
Moreover, SoFiA is able to estimate the reliability of individual 
detections, which should be particularly useful for surveys expected 
to detect a large number of sources. It can also produce a variety 
of output products, including moment images, cut-out cubes and 
images, integrated spectra and catalogues of source parameters. Fi¬ 
nally, SoFiA is able to search for line emission in arbitrarily large 
data cubes on the basis of a catalogue of 3D coordinates. Most of 
these methods are not available in other source finders and are cur¬ 
rently unique to SoFiA. 

We provide a few visual examples of how SoFiA works in¬ 
cluding a view of the dedicated graphical user interface. We de¬ 
scribe the available parameterization and the wide range of output 
products, which include mask cubes, moment images, position- 
velocity diagrams and busy function spectral fits of individual 


sources. This output is designed to both provide a useful descrip¬ 
tion of the sources as well as facilitate subsequent analysis. 

We highlight the modularity of SoFiA, which allows users to 
optimize the source-finding and parameterization strategy for the 
data and sources of interest. This modularity also enables future 
expansions of SoFiA to include new source-finding and parame- 
terisation algorithms. 

SoFiA is publicly available at the website indicated in Sec. 
[^together with technical information on how to use the software. 
Software updates, improvements and bug fixes are posted regularly 
at this webpage. SoFiA is registered at the Astrophysics Source 
Code Library with ID ascl: 1412.001. 
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